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Abstract 

Several biological problems require the identification of regions in a sequence where 
some feature occurs within a target density range: examples including the location of GC- 
rich regions, identification of CpG islands, and sequence matching. Mathematically, this 
corresponds to searching a string of Os and Is for a substring whose relative proportion 
of Is lies between given lower and upper bounds. We consider the algorithmic problem of 
locating the longest such substring, as well as other related problems (such as finding the 
shortest substring or a maximal set of disjoint substrings). For locating the longest such 
substring, we develop an algorithm that runs in 0{n) time, improving upon the previous 
best-known 0{nlogn) result. For the related problems we develop 0(71, log log n) algo- 
rithms, again improving upon the best-known 0{nlogn) results. Practical testing verifies 
that our new algorithms enjoy significantly smaller time and memory footprints, and can 
process sequences that are orders of magnitude longer as a result. 

AMS Classification Primary 68W32; Secondary 92D20 

Keywords Algorithms, string processing, substring density, bioinformatics 

1 Introduction 

In this paper we develop fast algorithms to search a sequence for regions in which a given feature 
appears within a certain density range. Such problems are common in biological sequence 
analysis; examples include: 

• Locating GC-rich regions, where G and C nucleotides appear with high frequency. GC- 
richness correlates with factors such as gene density |23], gene length |5], recombination 
rates [TU] , codon usage [5T] , and the increasing complexity of organisms [HJ [T3] . 

• Locating CpG islands, which have a high frequency of CpG dinucleotides. CpG islands are 
generally associated with promoters [T71 HD] , are useful landmarks for identifying genes 
[18], and play a role in cancer research [9]. 

• Sequence alignment, where we seek regions in which multiple sequences have a high rate 
of matches |23) . 

Further biological applications of such problems are outlined in [11] and [19] . Such problems 
also have applications in other fields, such as cryptography P| and image processing I12j. 

We represent a sequence as a string of Os and Is (where 1 indicates the presence of the feature 
that we seek, and indicates its absence). For instance, when locating GC-rich regions we let 1 
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and denote GC and TA pairs respectively. For any substring, we define its density to be the 
relative proportion of Is (which is a fraction between and 1). Our density constraint is the 
following: given bounds 9i and 62 with 9i < 02, we wish to locate substrings whose density lies 
between 61 and 62 inclusive. 

The specific values of the bounds 61,62 depend on the particular application. For instance, 
CpG islands can be divided into classes according to their relationships with transcriptional 
start sites [T7], and each class is found to have its own characteristic range of GC content. 
Likewise, isochores in the human genome can be classified into five families, each exhibiting 
different ranges of GC-richness [3l [24] . 

We consider three problems in this paper: 

(a) locating the longest substring with density in the given range; 

(b) locating the shortest substring with density in the given range, allowing optional con- 
straints on the substring length; 

(c) locating a maximal cardinality set of disjoint substrings whose densities all lie in the given 
range, again with optional length constraints. 

The prior state of the art for these problems is described by Hsieh et al. [Tl], who present 
O(nlogn) algorithms in all three cases. In this paper we improve the time complexities of 
these problems to 0{n), ©(n log log n) and ©(n log log n) respectively. In particular, our 0{n) 
algorithm for problem (a) has the fastest asymptotic complexity possible. 

Experimental testing on human genomic data verifies that our new algorithms run signif- 
icantly faster and require considerably less memory than the prior state of the art, and can 
process sequences that are orders of magnitude longer as a result. 

Hsieh et al. consider a more general setting for these problems: instead of Os and Is they 
consider strings of real numbers (whereupon "ratio of Is" becomes "average value"). In their 
setting they prove a theoretical lower bound of il{nlogn) time on all three problems. The key 
feature that allows us to break through their lower bound in this paper is the discrete (non- 
continuous) nature of structures such as DNA; in other words, the ability to represent them as 
strings over a finite alphabet. 

Many other problems related to feature density are studied in the literature. Examples 
include maximising density under a length constraint 1121 119] , finding all substrings under 
range of density and length constraints [Mj [15], finding the longest substring whose density 
matches a precise value [H [5], and one-sided variants of our first problem with a lower density 
bound 61 but no upper density bound ^2 [H [3 [6l [Ml . 

We devote the first half of this paper to our 0{n) algorithm for locating the longest substring 
with density in a given range: Section [2] develops the mathematical framework, and Sections [3| 
and [4| describe the algorithm and present performance testing. In Section [5| we adapt our 
techniques for the remaining two problems. 

All time and space complexities in this paper are based on the commonly-used word RAM 
model [71 §2.2], which is reasonable for modern computers. In essence, if n is the input size, we 
assume that each (logn)-bit integer takes constant space (it fits into a single word) and that 
simple arithmetical operations on (log n)-bit integers (words) take constant time. 
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2 Mathematical Framework 



We consider a string of digits zi , . . . , z„ , where each Zi is either or 1 . The length of a substring 
Zq, . . . , Zfc is defined to be i(a, b) — b — a + 1 (the number of digits it contains), and the density 
of a substring z^, . . . , z^ is defined to be D{a, h) — Y^i=a '^ilL{a, b) (the relative proportion of 
Is). It is clear that the density always lies in the range < D{a, b) < 1. 

Our first problem is to find the maximum length substring whose density lies in a given 
range. Formally: 

Problem 1. Given a string zi, . . . , z„ as described above and two rational numbers 9i = ci/di 
and 02 = C2/d2, compute 

max {L{a,b)\9i<D{a,b)<e2}. 

l<a<b<n 

We assume that < 9i < 62 < 1, that < di,d2 < n, and that gcd{ci,di) — gcd{c2,d2) = f . 

For example, if the input string is llOOOIOIOI (with n — 10) and the bounds are 9i — l/A 
and 62 — f/3 then the maximum length is 7. This is attained by the substring ll OOOIOIO I 
(a = 3 and 6 = 9), which has density D{3, 9) = 2/7 ~ 0.286. 

The additional assumptions in Problem [T] are harmless. If 6*1 = or = 1 then the 
problem reduces to a one-sided bound, for which simpler linear-time algorithms are already 
known [T] [6l [23] . If = 62 then the problem reduces to matching a precise density, for which a 
linear-time algorithm is also known 5 . If some 9i is irrational or if some di > n, we can adjust 
9i to a nearby rational for which di < n without affecting the solution. 

We consider two geometric representations, each of which describes the string zi, . . . , z„ as 
a path in two-dimensional space. The first is the natural representation, defined as follows. 

Definition 2. Given a string zi,...,z„ as described above, the natural representation is the 
sequence of n -f I points po, . . . , p„ where each has coordinates (fc, Xli=i ^fc)- 




■ ,0''' N H 

Z3, 24, . . . , Zg 
h 1 1 1 1 1 1 1 1 1 1 ► 



Figure I: The natural representation of the string llOOOIOIOI 

The X and y coordinates of p^ effectively measure the number of digits and the number of 
Is respectively in the prefix string zi, . . . , Zfe. See Figured] for an illustration. 

The natural representation is useful because densities have a clear geometric interpretation: 

Lemma 3. In the natural representation, the density D{a,b) is the gradient of the line segment 
joining Pa 1 with p^. 
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The proof follows directly from the definition of D{a, b). The shaded cone in Figure [T] shows 
how, for our example problem, the gradient of the line segment joining p2 with pg (i.e., the 
density D(3,9)) hes within our target range [^1,^2] = [1/4, 1/3]. 

Our second geometric representation is the orthogonal representation. Intuitively, this is 
obtained by shearing the previous diagram so that lines of slope 9i and 62 become horizontal 
and vertical respectively, as shown in Figured] Formally, we define it as follows. 



02 

o 

qio\ 




Figure 2: The orthogonal representation of the string 1100010101 for [6*1, 6^2] = [1/4, 1/3] 

Definition 4. Given a string zi,...,z„ and rational numbers 9i = ci/di and 62 = 22/^2 as 
described earlier, the orthogonal representation is the sequence oi n + 1 points qo, . . . , Qn, where 
each qfe has coordinates (c2fc — d2 —cik + di X^iLi ^i)- 

From this definition we obtain the following immediate result. 

Lemma 5. qo = (0,0), and for i > we have q^ — q^-i + (c2,— ci) if Zi = or q^ = 
qj-i + (c2 - d2,di - ci) if Zi = 1. 

The key advantage of the orthogonal representation is that densities in the target range 
[9i, 62] correspond to dominating points in our new coordinate system. Here we use a non-strict 
definition of domination: a point {x,y) is said to dominate {x' ,y') if and only if both x > x' 
and y > y' ■ 

Theorem 6. The density of the substring Za, ■ ■ ■ , Zb satisfies 61 < D{a, b) < 62 if and only if 
dominates qa-i. 

Proof. The difference q^ — qa-i has coordinates 

b b 

(c2(6 - a + 1) - d2^Zi, -ci{b - a + 1) + rfi ^ z^) 

i—a i—a 

= L{a,b) ■ {c2 - d2D{a,b), -ci + diD{a,b)) , 

which are both non- negative if and only if D{a, 6) < 02/^2 = 02 and D(a, b) > ci/di — 61. □ 

The shaded cone in Figure [2] shows how qg dominates q2 in our example, indicating that the 
substring Z3, . . . , zg has a density in the range [1/4, 1/3]. 
It follows that Problem [T] can be reinterpreted as: 
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Problem [TJ. Given the orthogonal representation qo, . . . , q,i as defined above, find points q^, q* 
for which (\t dominates q^ and t ^ s is as large as possible. 

The correspondmg substrmg that solves Problem [1] is Zs+i, • • • , -Zt- 

We finish this section with two properties of the orthogonal representation that are key to 
obtaining a linear time algorithm for this problem. 

Lemma 7. The coordinates of each point q^ are integers in the range [— n^,n^]. 

Proof. This follows directly from Lemma O qo = (0,0), and the coordinates of each subsequent 
qi are obtained by adding integers in the range [—n,n] to the coordinates of q^-i. □ 

Lemma 8. //qi dominates q^ then i> j. 

Proof. Consider the linear function /: — > M defined by f{x, y) — dix + d2y. It is clear from 
Lemma[5]that /(qo) = and /(qi) = /(qi-i) + diC2 — d2Ci. Since 9i — ci/di < 02/^2 = 62 it 
follows that /(qi) > /(qi-i)- 

Suppose qi dominates q^. By definition of / we have /(qi) > /(qj), and by the observation 
above it follows that i > j. □ 



3 Algorithm 

To solve Problem ^ we construct and then scan along the inner and outer frontiers, which we 
define as follows. 

Definition 9. Consider the orthogonal representation qo, . . . , q„ for the input string zi, . . . , Zn- 
The inner frontier is the set of points qfe that do not dominate any q.i for i ^ k. The outer 
frontier is the set of points q^ that are not dominated by any q.i for i 7^ fc. 




Figure 3: The inner and outer frontiers 

Figure |3] illustrates both of these sets. They are algorithmically important because of the 
following result. 

Lemma 10. If (\s and qt form a solution to ProblemTl^ then q^ lies on the inner frontier and 
c\t lies on the outer frontier. 

Proof. If q^ is not on the inner frontier then q^ dominates q^ for some i ^ s. By Lemma |S] we 
have i < s, which means that q^ and qt cannot solve Problem ^ since q* dominates qi and 
t ~ i > t ~ s. The argument for q^ on the outer frontier is similar. □ 
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3.1 Data structures 



The data structures that appear in this algorithm are simple. 

For each point = {xi,yi), we refer to i as the index of q^, and we store the point as a triple 
{i,Xi,yi). If t is such a triple, we refer to its three constituents as t.idx, t.x and t.y respectively. 

We make frequent use of lists of triples. If L is such a list, we refer to the first and last 
triples in L as L. first and L.last respectively, we denote the number of triples in L by L.size, 
and we denote the individual triples in L by L[0], L[l], . . . , L[L.size — 1]. All lists are assumed 
to have 0(1) insertion and deletion at the beginning and end, and 0{L.size) iteration through 
the elements in order from first to last (as provided, for example, by a doubly- linked list). 

For convenience we may write q^ € L to indicate that the triple describing q^ is contained 
in L; formally, this means (i,Xi,yi) € L. 

3.2 The two-phase radix sort 

The algorithm make use of a two-phase radix sort which, given a list of i integers in the range 
[0,6^), allows us to sort these integers in 0(£ + b) time and space. In brief, the two-phase radix 
sort operates as follows. 

Since the integers are in the range [0,6^), we can express each integer fc as a "two-digit 
number" in base 6; in other words, a pair (a, /3) where k = a + /3 ■ b and a, j3 are integers in the 
range Q < a, (3 < b. 

We create an array of 6 "buckets" (linked lists) in memory for each possible value of a. In a 
first pass, we use a counting sort to order the integers by increasing a (the least significant digit): 
this involves looping through the integers to place each integer in the bucket corresponding to 
the digit a (a total of £ distinct 0(1) list insertion operations), and then looping through the 
buckets to extract the integers in order of a (effectively concatenating b distinct lists with total 
length i). This first pass takes 0(^ + 6) time in total. 

We then make a second pass using a similar approach, using another 0{i -\-b) counting sort 
to order the integers by increasing j3 (the most significant digit). Importantly, the counting sort 
is stable and so the final result has the integers sorted by (3 and then a\ that is, in numerical 
order. The total running time is again 0{l + b), and since we have 6 buckets with a total of I 
elements, the space complexity is likewise 0{£ + b). 

In our application, we need to sort a list of n -f 1 integers in the range [— n^, n^]; this can be 
translated into the setting above with £ = n + 1 and b = 2n, and so the two-phase radix sort 
has 0(n) time and space complexity. 

This is a specific case of the more general radix sort; for further details the reader is referred 
to a standard algorithms text such as [7j. 

3.3 Constructing frontiers 

The first stage in solving ProblemfTDis to construct the inner and outer frontiers in sorted order, 
which we do efficiently as follows. The corresponding pseudocode is given in Figured 

Algorithm 11. To construct the inner frontier / and the outer frontier O, both in order by 
increasing x coordinate: 

1. Build a list L of triples corresponding to all n -I- 1 points qo, . . . , qn, using Lemma [5l Sort 
this list by increasing x coordinate using a two-phase radix sort as described above, noting 
that the sort keys Xi are all integers in the range [—n^,n'^] (Lemma [7]). 
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1: [(0,0,0)] 

2: for i ^ 1 to n do 

3: if Zi = then 

4: Append L.last + (1, C2, —ci) to the end of L 

5: else 

6: Append L.last + (1, C2 — d2,di ~ ci) to the end of L 

7: Sort L by increasing a; using a two-phase radix sort 

8: / ^ [L.first] 

9: for all £ G L, moving forward through L do 
10: if £.y < I. last. y then 

11: if £.x = I. last. X then > l.last dominates £ 

12: Remove the last triple from / 

13: Append £ to the end of / 

14: O ^ [L.last] 

15: for all £ E L, moving backwards through L do 
16: if £.y > O. first. y then 

17: if £.x — O. first. X then o £ dominates O. first 

18: Remove the first triple from O 

19: Prepend £ to the beginning of O 



Figure 4: The pseudocode for Algorithm [TT] 

2. Initialise / to the one-element list [L.first]. Step through L in forward order (from left to 
right in the diagram); for each triple £ E L that has lower y than any triple seen before, 
append £ to the end of /. 

3. Construct O in a similar fashion, working through L in reverse order (from right to left). 

In step [21 there is a complication if we append a new triple £ to I for which £.x = Llast.x. 
Here we must first remove Llast since £ makes it obsolete. See lines [TTHT^] of Figure 0] for the 
details. 

Table [T] shows a worked example for step [5] of the algorithm, i.e., the construction of the 
inner frontier. The points in this example correspond to Figure |31 and each row of the table 
shows how the frontier / is updated when processing the next triple £ d L (for simplicity we 
only show the coordinate pairs {xi,yi) from each triple). Note that, although L is sorted by 
increasing x coordinate, for each fixed x coordinate the corresponding y coordinates may appear 
in arbitrary order. 

Theorem 12. Alaorithm \ll\ constructs the inner and outer frontiers in I and O respectively, 
with each list sorted by increasing x coordinate, in 0{n) time and 0{n) space. 

Proof. This algorithm is based on a well-known method for constructing frontiers. We show 
here why the inner frontier / is constructed correctly; a similar argument applies to the outer 
frontier O. 
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Coordinates {xi,yi) 
from the triple £ £ L 


Current inner frontier I 


(-10,15) 


[ (-10,15) 1 






(-8, 12) 


[ (-10,15), 


(-8,12)] 




(-8, 8) 


[ (-10,15) ] 








[ (-10,15), 


(-8,8) ] 




(-6, 9) 


[ (-10,15), 


(-8,8) ] 




(-6, 7) 


[ (-10,15), 


(-8,8) (-6,7) ] 




(-6, 11) 


[ (-10,15), 


(-8,8) (-6,7) ] 




(-4, 6) 


[ (-10,15), 


(-8,8) (-6,7) (- 


-4,6)] 


(-4, 8) 


[ (-10,15), 


(-8,8) (-6,7) (- 


-4,6) ] 


(-4, 4) 


[ (-10,15), 


(-8,8) (-6,7) ] 






[ (-10,15), 


(-8,8) (-6,7) (- 


-4,4)] 


(-2, 5) 


[ (-10,15), 


(-8,8) (-6,7) (- 


-4,4)] 


(-0, 0) 


[ (-10,15), 


(-8,8) (-6,7) (- 


-4,4) (0,0)] 



Table 1: Constructing the inner frontier 



If a triple £ L with coordinates {xi,yi) does belong on the inner frontier (i.e., there is 
no other point {xj,yj) in the list that it dominates), then we are guaranteed to add it to / in 
step m because the only triples processed thus far have x < Xi, and must therefore have y > yi. 
Moreover, we will not subsequently remove £ from / again, since the only other triples with the 
same x coordinate must have y > yi. 

If a triple £ E L with coordinates {xi,yi) does not belong on the inner frontier, then there is 
some point {xj, yj) that it dominates. If Xj < Xi then we never add £ to /, since by the time we 
process I we will already have seen the coordinate yj (which is at least as low as yi). Otherwise 
Xj = Xi and yj < yi, and so we either add and then remove £ from / or else never add it at all, 
depending on the order in which we process the points with x coordinate equal to Xi. Either 
way, £ does not appear in the final list /. 

Therefore the list / contains precisely the inner frontier; moreover, since we process the 
points by increasing x coordinate, the list I will be sorted by x accordingly. 

The main innovation in this algorithm is the use of a radix sort with two-digit keys, made 
possible by Lemma [3 which allow us to avoid the usual O{n\ogn) cost of sorting. The two- 
phase radix sort runs in 0{n) time and space, as do the subsequent list operations in Steps 2-4, 
and so the entire algorithm runs in 0{n) time and space as claimed. □ 

3.4 Sliding windows 

The second stage in solving Problem ^ is to simultaneously scan through the inner and outer 
frontiers in search of possible solutions Qs G / and G O. 

We do this by trying each in order on the inner frontier, and maintaining a sliding window 
W of possible points qt ; specifically, W consists of all points on the outer frontier that dominate 
qs. Figure [5] illustrates this window W as qs moves along the inner frontier from left to right. 

For each point q^ G / that we process, it is easy to update W in amortised 0(1) time using 
sliding window techniques (pushing new points onto the end of the list W as they enter the 
window, and removing old points from the beginning as they exit the window). However, we 
still need a fast way of locating the point qt e O that dominates q^ and for which t — s is largest. 
Equivalently, we need a fast way of choosing the triple w £ W that maximises w.idx. 
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To do this, we maintain a sub-list M C W: this is a hst consisting of ah triples w £W that 
are potential maxima. Specifically, for any triple w € W, we include w in AI if and only if there 
is no w' € W for which w' .x > w.x and w' .idx > w.idx. The rationale is that, if there were 
such a w' , we would always choose w' over w in this or any subsequent window. 

As with all of our lists, we keep M sorted by increasing x coordinate. Note that the condition 
above implies that M is also sorted by decreasing index. In particular, the sought-after triple 
w € W that maximises w.idx is simply AI. first, which we can access in 0(1) time. 

Crucially, we can also update the sub-list AI in amortised 0(1) time for each point G / 
that we process. As a result, this sub- list M allows us to maximise w.idx for w £ W whilst 
avoiding a costly linear scan through the entire window W. 

The details are as follows; see Figure |6] for the pseudocode. 

Algorithm 13. Let the inner and outer frontiers be stored in the lists / and O in order by 
increasing x coordinate, as generated by Algorithm [TT] We solve Problem^ as follows. 

1. Initialise AI to the empty list. 

2. Step through the inner frontier / in forward order. For each triple inner £ I: 

(a) Process new points that enter our sliding window. To do this, we scan through any 
new triples outer e O for which outer. y > inner. y and update AI accordingly. 

Each new outer G O that we process has outer. x > m.x for all m G M, so we append 
outer to the end of AI. However, before doing this we must remove any m € AI 
for which m.idx < outer. idx (since such triples would violate the definition of AI). 
Because AI is sorted by decreasing index, all such m e Af can be found at the end 
of M. See lines [5H9] of Figured 

(b) Remove points from AI that have exited our sliding window. That is, remove triples 
m £ M for which m.x < inner. x. 

Because AI is sorted by increasing x coordinate, all such triples can be found at the 
beginning of AI. See lines [TUHTT] of Figure [Bl 

(c) Update the solution. The best solution to Problem fTTl that uses the triple inner g / 
is the pair of points dinner. idx , <lM. first. idx ■ If the difference M. first. idx — inner. idx 
exceeds any seen so far, record this as the new best solution. 

Theorem 14. Alaorithms M 1\ and \13\ together solve ProblemUll in 0{n) time and 0{n) space. 
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1 


M ^ [] 


> empty list 


2 


Sbost ^ 0, tbost ^0 > 


best solution so far 


3 


next ^0 > next element of O to scan through 


4 


for all inner € I, moving forward through / do 




5 


while next < O.size and 0[next].y > inner. y do 




6 


next -s— next + 1 




7 


while M.size > and 0[next].idx > M.last.idx do 




8 


Remove the last triple from M 




9 


Append 0[next] to the end of M 




10 


while M.size > and M. first. x < inner. x do 




11 


Remove the first triple from M 




12 


if M. first. idx — inner. idx > ibcst — Sbost then 




13 


Sbost ^ inner. idx 




14 


^bcst "5" M. first. idx 




Figure 6: The pseudocode for Algorithm [T3l 



Proof. Theorem [T^] analyses Algorithm 1111 and the preceding discussion shows the correctness 
of Algorithm [131 All that remains is to verify that Algorithm [T3l runs in 0{n) time and space. 

Each triple t e O is added to M at most once and removed from M at most once, and so 
the while loops on lines [5l iTl and ITOl each require total 0{n) time as measured across the entire 
algorithm. Finally, the outermost for loop (line 4) iterates at most n+l times, giving an overall 
running time for Algorithm 1131 of 0{n). 

Each of the lists /, O and AI contains at most n+l elements, and so the space complexity 
is 0{n) also. □ 



4 Performance 

Here we experimentally compare our new algorithm against the prior state of the art, namely 
the O{n\ogn) algorithm of Hsieh et al. [H]. Our trials involve searching for GC-rich regions in 
the human genome assembly GRCh37.p2 from GenBank [2j[T6]. The implementation that we 
use for our new algorithm is available onlinefl] and the code for the prior O{n\ogn) algorithm 
was downloaded from the respective authors' websiteH Both implementations are written in 
C/C++. 

Figure [7] measures running times for 24 x 4 x 3 = 288 instances of Problem [T] we begin 
with 24 human chromosomes (1-22, X and Y), extract initial strings of four different lengths n 
(ranging from n — 100 000 to n = 3 000 000), and search each for the longest substring whose 
GC-density is constrained according to one of three different ranges [01,6*2]. 

These ranges are: [0.6326, 0.7428], which matches the first CpG island class of loshikhes 
and Zhang [17 ; [0.69905, 0.69915], which surrounds the median of this class and measures 
performance for a narrow density range; and [2/3, 3/4], which measures performance when the 
key parameters di and d2 are very small. 

^For C++ implementations of all algorithms in this paper, visit http://www.maths.uq.edu.au/~bab/code/. 
■^The implementation of the prior algorithm ,14. is taken from |http : //venus ■ cs .n thu. edu . tw/- eric/FIF .htm[ 
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Figure 7: Performance comparisons on genomic data 



The results are extremely pleasing: in every case the new algorithm runs at least 10 x faster 
than the prior state of the art, and in some cases up to 42x faster. Of course such comparisons 
cannot be exact or fair, since the two implementations are written by different authors; however, 
they do illustrate that the new algorithm is not just asymptotically faster in theory (as proven 
in Theorem [H)) . but also fast in practice (i.e., the constants are not so large as to eliminate the 
theoretical benefits for reasonable inputs). 

The results for the range [2/3, 3/4] highlight how our algorithm benefits from small denom- 
inators (in which the range of possible x and y coordinates becomes much smaller). 

Memory becomes a significant problem when dealing with very large data sets. The algo- 
rithm of Hsieh et al. |14[ uses "heavy" data structures with large memory requirements: for 
n = 3 000 000 it uses 1.64 GB of memory. In contrast, our new algorithm has a much smaller 
footprint — ^just 70 MB for the same n — and can thereby process values of n that are orders of 
magnitude larger. In Figure [S] we run our algorithm over the full length of each chromosome; 
even the worst case (chromosome 1) with n = 249 250 621 runs for all density ranges in under 
85 seconds, using 5.6 GB of memory. 
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Figure 8: Performance of the new algorithm on full-length chromosomes 
All trials were run on a single 3 GHz Intel Core 17 CPU. Input and output are included in 
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running times, though detailed measurements show this to be insignificant for both algorithms 
(which share the same input and output routines). 

5 Related Problems 

The techniques described in this paper extend beyond Problem [TJ Here we examine two related 
problems from the bioinformatics literature, and for each we outline new algorithms that improve 
upon the prior state of the art. As usual, all problems take an input string zi, . . . , z„ where 
each is or 1. 

The new algorithms in this section rely on van Emde Boas trees [22j . a tree-based data 
structure for which many elementary key-based operations have ©(n log log n) time complexity. 
We briefly review this data structure before presenting the two related problems and the new 
algorithms to solve them. 

5.1 van Emde Boas trees 

Here we briefly recall the essential ideas behind van Emde Boas trees. For full details we refer 
the author to a modern textbook on algorithms such as [7]. 

A van Emde Boas tree is a data structure that implements an associative array (mapping keys 
to values), in which the user can perform several elementary operations in ©(logm) time, where 
m is the number of bits in the key. These elementary operations include inserting or deleting 
a key-value pair, looking up the value stored for a given key, and looking up the successor or 
predecessor of a given key k (i.e., the first key higher or lower than k respectively). In our case, 
all keys are in the range [— (Lemma[7l), and so m € O(logn^) = 0{\ogn); that is, these 
elementary operations run in ©(log log n) time. 

The core idea of this data structure is that each node of the tree represents a range of p 
consecutive possible keys for some p, and has ^ children (each a smaller van Emde Boas tree) 
that each represent a sub-range of possible keys. Each node also maintains the minimum and 
maximum keys that are actually present within its range. The root node of the tree represents 
the complete range of 2™ possible keys. 

Furthermore, for each node V of the tree representing a range of p possible keys, we also 
maintain an auxiliary van Emde Boas tree that stores which of the children of V are non- 
empty (i.e., have at least one key stored within them). 

To look up the successor of a given key k we travel down the tree, and each time we reach 
some node Vi that represents pi potential keys, we identify which of the ^/pl children contains 
the successor of k by examining the auxiliary tree attached to Vi. This induces a query on the 
auxiliary tree (representing ^/pi potential non-empty child trees) followed by a query on the 
selected child (representing potential keys), and so the running time follows a recurrence 
of the form T{m) — 2T(m/2) -I- 0(1); solving this recurrence yields the overall 0{log{m)) time 
complexity. The running times for inserting and deleting keys follow a similar argument, and 
again we refer the reader to a text such as 7 for the details. 

5.2 Shortest substring in a density range 

The first related problem that we consider is a natural counterpart to Problem [TJ instead of 
searching for the longest substring under given density constraints, we search for the shortest. 
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Problem 15. Find the shortest substring whose density lies in a given range. That is, given 
rationals 9i < O2, compute 

mill {L{a,b)\9i<D{a,b)<92}. 

l<a<b<n 

The best known algorithm for this problem runs in 0{n\ogn) time [14j : here we improve 
this to O(nloglogn). 

By Theorem [6] and Lemma [8l this is equivalent to finding points qs 7^ qt for which qt 
dominates qs and for which i — s is as small as possible. To do this, we iterate through 
each possible endpoint q* in turn, and maintain a partial outer frontier P consisting of all 
non-dominated points amongst the previous points {qo, qi, . . . , qt-i}; that is, all points q^ 
{0 < i < t — 1) that are not dominated by some other q^ {i < j < t — 1). When examining 
a candidate endpoint q* , it is straightforward to show that any optimal solution q^ , qt must 
satisfy q^. € P. 

Algorithm 16. To solve Problem [TSl 

1. Initialise P to the empty list. 

2. For each t = 0, . . . , n in turn, try qt as a possible endpoint: 

(a) Update the solution. To do this, walk through all points q^ G P that are dominated 
by qt. If the difference t — s is smaller than any seen so far, record this as the new 
best solution. 

(b) Update the partial frontier. To do this, remove all q^ € P that are dominated by qt, 
and then insert qt into P. 

The key to a fast time complexity is choosing an efficient data structure for storing the 
partial outer frontier P. We keep P sorted by increasing x coordinate, and for the underlying 
data structure we use a van Emde Boas tree |22j . 

To walk through all points qs G P that are dominated by qt in step 2a, we locate the 
point p G P with smallest x coordinate larger than xt] a van Emde Boas tree can do this in 
©(loglogn) time. The dominated points qs can then be found immediately prior to p in the 
partial frontier. Adding and removing points in step 2b is likewise ©(log log n) time, and the 
overall space complexity of a van Emde Boas tree can be made 0{n) |7J. 

A core requirement of this data structure is that keys in the tree can be described by integers 
in the range 0, . . . , n. To arrange this, we pre-sort the x coordinates qo.a;, . . . , q„.a; using an 
0{n) two-phase radix sort (as described in Section [3. 2p and then replace each x coordinate with 
its corresponding rank. 

To finalise the time and space complexities, we observe that each point qs G P that is 
processed in step 2a is immediately removed in step 2b, and so no point is processed more than 
once. Combined with the preceding discussion, this gives: 

Theorem 17. Alaorithm \16] runs in ©(nloglogn) time and uses 0{n) space. 

We can add an optional length constraint to Problem 1151 given rationals 9i < 92 and 
length bounds Li < L2, find the shortest substring Za, . . . , Zb for which 9i < D{a, b) < 02 and 
Li < L{a,b) < L2. This is a simple modification to Algorithm \W[ we redefine the partial 
frontier P to be the set of all non-dominated points amongst {qo, qi, . . . , qt-Li}- The update 
procedure changes slightly, but the ©(nloglogn) running time remains. 
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5.3 Maximal collection of substrings in a density range 

The second related problem involves searching for substrings "in bulk" : instead of finding the 
longest substring under given density constraints, we find the most disjoint substrings. 

Problem 18. Find a maximum cardinality set of disjoint substrings whose densities all lie in 
a given range. That is, given rationals Oi < 02, find substrings (^ai, • • • , -Zbi), (2^021 • ■ • i ^62)^ • • • ; 
(Zflfc, • • ■ ,Zb^) where 9i < D{ai,bi) < 62 and bi < Oj+i for each i, and where k is as large as 
possible. 

As before, the best known algorithm runs in 0{n\ogn) time |14j : again we improve this 
bound to 0{n log log n) . 

For this problem we mirror the greedy approach of Hsieh et al. [H]. One can show that, 
if Za, . . . , Zf, is a substring of density 9i < D{a, b) < 62 with minimum endpoint b, then some 
optimal solution to Problem [TH] has bi — b (i.e., we can choose Za, ■ . ■ , Zf, as our first substring). 
See [131 Lemma 6]. 

Our strategy is to use our previous Algorithm [T6l to locate such a substring Za, . . . , Zf,, store 
this as part of our solution, and then rerun our algorithm on the leftover n — b input digits 
Zf,+i, . . . , z„. We repeat this process until no suitable substring can be found. 

Algorithm 19. To solve Problem [T8l 

1. Initialise i ^ 1. 

2. Run Algorithm 1161 on the input string z^, . . . ,z„, but terminate Algorithm 1161 as soon as 
any dominating pair , qi is found. 

3. If such a pair is found: add the corresponding substring z^+i, . . . , Zf to our solution set, 
set i -s— < + 1, and return to stcp[5J Otherwise terminate this algorithm. 

It is important to reuse the same van Emde Boas tree on each run through Algorithm [16] 
(simply empty out the tree each time), so that the total initialisation cost remains 0{n log log n). 

Theorem 20. Algorithm ] 19\ runs in O{nlog\ogn) time and requires 0{n) space. 

Proof. Running Algorithm [TBI in step [2] takes 0{[t — i] log log n) time, since it only examines 
points qi, . . . , qt before the dominating pair is found. The total running time of Algorithm 1191 
is therefore 0{bi log log n + [62 — ^i] log log n + . . . + [fe^ — bk-i] log log n) = ©(n log log n). The 
0{n) space complexity follows from Theorem [T71 plus the observation that the final solution set 
can contain at most n disjoint substrings. □ 

As before, it is simple to add a length constraint to Problem [TSl so that each substring 
Zai,...,Zh. must satisfy both 61 < D{ai,bi) < 62 and Li < L{ai,bi) < L2 for some length 
bounds Li < L2. We simply incorporate the length constraint into Algorithm [T6l as described 
in Section [5T2l and nothing else needs to change. 



5.4 Performance 



As before, we have implemented both Algorithms [16] and [19] and tested each against the prior 
state of the art using human genomic data, following the same procedures as described in 
Section [3] Our van Emde Boas tree implementation is based on the MIT- licensed libveb by Jani 



Lahtinen, available from http://code.google.eom/p/libveb/ 
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For the shortest substrmg problem, our algorithm runs at 12-62 times the speed of the prior 
algorithm |14| , and requires under 1 /50th of the memory. For finding a maximal collection of 
substrings, our algorithm runs at 0.94-13 times the speed of the prior algorithm [T3], and uses 
less than l/20th of the memorylf] 

Once again, these improvements — particularly for memory usage — allow us to run our algo- 
rithms with significantly larger values of n than for the prior state of the art. For chromosome 1 
with n = 249 250 621, the two algorithms run in 221 and 176 seconds respectively, both using 
approximately 5.6 GB of memory. 

6 Discussion 

In this paper we consider three problems involving the identification of regions in a sequence 
where some feature occurs within a given density range. For all three problems we develop new 
algorithms that offer significant performance improvements over the prior state of the art. Such 
improvements are critical for disciplines such as bioinformatics that work with extremely large 
data sets. 

The key to these new algorithms is the ability to exploit discreteness in the input data. All 
of the applications we consider (GC-rich regions, CpG islands and sequence alignment) can be 
framed in terms of sequences of Is and Os. Such discrete representations are powerful; through 
Lemma [21 they allow us to perform 0{n) two-phase radix sorts, and to reindex coordinates by 
rank for van Emde Boas trees. In this way, discreteness allows us to circumvent the theoretical 
fl{n\ogn) bounds of Hsieh et al. [14] . which are only proven for the more general continuous 
(non-discrete) setting. 

In a discrete setting, an obvious lower bound for all three problems is J7(n) time (which is 
required to read the input sequence) . For the first problem (longest substring with density in a 
given range), we attain this best possible lower bound with our 0{n) algorithm. This partially 
answers a question of Chen and Chao who ask in a more general setting whether such an 
algorithm is possible. 

For the second and third problems (shortest substring and maximal collection of substrings) , 
although our ©(n log log n) algorithms have smaller time complexity and better practical per- 
formance than the prior state of the art, there is still room for improvement before we reach the 
theoretical n{n) lower bound (which may or may not be possible). Further research into these 
questions may prove fruitful. 

The techniques we develop here have applications beyond those discussed in this paper. 
For example, consider the problem of finding the longest substring whose density matches a 
precise value 6. This close relative of Problem [1] has cryptographic applications [4]. An 0{n) 
algorithm is known [S], but it requires a complex linked data structure. By adapting and 
simplifying Algorithms [TT] and [T51 for the case 9i = 02 ~ 0, we obtain a new 0{n) algorithm 
with comparable performance and a much simpler implementation. 

This last point raises the question of whether our algorithms for the second and third prob- 
lems can likewise be simplified to use only simple array-based sorts and scans instead of the 
more complex van Emde Boas trees. Further research in this direction may yield new practical 
improvements for these algorithms. 

^The few cases in which our algorithm was slightly slower (down to 0.94 times the speed) all involved large 
denominators di , d2 and maximal collections involving a very large number of very short substrings. 
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